Week 9 Exercise: Time Series Data and Analysis Strategies
Author
Chris Zuidema, Elena Austin, and Instructors for ENVH 556 Autumn 2024
This document was rendered on November 25, 2024.
Introduction and Purpose
The purpose of this lab is to practice working with time series data, develop facility in importing and merging data of varying time scales, smoothing approaches, and some analysis tools to reduce the complexity of these data. In addition, as with all the labs in this course, you should focus on writing up a coherent lab report that addresses the scientific context and scientific questions.
In this lab exercise, we will use data collected as part of the MOV-UP mobile monitoring campaign. This data was collected on 1-second and 10-second time scales. The lab will walk you through the steps of importing the raw instrument data. The MOV-UP study has collected data from the following instruments:
GPS tracker
particle counter (P-Trak, model 8525, TSI Inc., MN)
condensation particle counter (CPC, model 3007, TSI Inc., MN)
black carbon aethelometer (microAeth AE51)
particle sizing spectrometer (NanoScan)
carbon dioxide (CO2) (SenseAir K-30-FS)
The full suite of instruments and their characteristics of the platform used in this study can be found in Riley et al., 2014, Larson et al., 2017, and Xiang et al., 2020. The remainder of the code in this week’s lab uses the GPS, P-Trak, and CO2 measurements.
Dates and Times
Managing dates and times in exposure assessment is a common challenge because there are a variety of common date and time formats used to log instrument data. You’ve no doubt encountered the following representations of dates:
25 Nov, 2024
11/25/2024
2024-11-25
With direct-reading instruments made all over the world, you’ll have the potential to encounter many devices that use different date and time formats. You’ll find it advantageous to convert everything to a standardized format, including the time zone. Consistently using a standardized date and time format will make it less likely that you’ll interpret a date and time incorrectly.
Date objects in R are of class Date, and compound date and time (“datetime”) objects are of class POSIXct. Let’s take a look at these object types and show their classes:
#-----datetime.and.date objects-----# show POSIXct object -- a datetime object with both date and timeSys.time()
[1] "2024-11-25 10:42:56 PST"
# show class# Note that 'POSIXt' is a virtual class which cannot be used directly. This# virtual class exists from which both of the classes (POSIXct and POSIXlt)# inherit. POSIXt is used to allow operations, such as subtraction, to mix the# two classes.class(Sys.time())
[1] "POSIXct" "POSIXt"
# show Date object-- date without time includedSys.Date()
[1] "2024-11-25"
# show classclass(Sys.Date())
[1] "Date"
Internally, date and datetime objects store the number of seconds (for POSIXct) or days (for Date) since January 1st, 1970, at midnight. So for the current time and date, these values are 1.7325602^{9} seconds and 2.0052^{4} days. When printed to the console, they are displayed in a human-readable format along with the International Organization for Standardization (ISO) 8601 date and time standard. ISO 8601 provides an unambiguous, well-known, and well-defined method of representing dates and times.
It is often necessary to define date/time components or the format of a datetime object specifically for each instrument. There are several reasons in practice why we need to pay careful attention to this:
The time scale of each instrument can be different.
The internal instrument clocks can be different from each other and/or the current time.
The instruments may be using different time zones, daylight savings time conventions, or there may be issues with simultaneous measurements on either side of the international date line.
Regardless, it is necessary to ensure a common schedule prior to merging the data. Accounting for these issues during the initial data wrangling will make it easier to handle date and datetime variables throughout your analysis process.
Identify Files of Interest, Read Files & Wrangle
Here, we’ll focus on CO2 data from Car 1 (there were 2 cars); for your assignment, we ask you to analyze data from P-Traks, either using the same car or the other car.
In each of the following chunks, we show the tibble. In the .html file, you can scroll through the first 10,000 rows. However, don’t assume that 10,000 is the maximum number of observations in the dataset. To learn that, use glimpse or str.
#-----gather file names for import-----# get filename paths for data filesfilenames <-list.files(data_dir) %>%str_subset("car1")
GPS
#-----gps-----# Get GPS file name gps_file <-str_subset(filenames, "GPS")# read GPS datagps <-read_csv(file.path(data_dir, gps_file)) %>%# set column namesset_names(c("date", "time", "latitude", "longitude", "altitude", "speed")) %>%# create datetime column (notice we must specify the timezone)mutate(datetime =as.POSIXct(paste(date, time), tz ="US/Pacific"))# show dataframegps
These instruments collect ultrafine particles (UFPs) at 1-second time resolution. In order to capture the change in ultrafine particles, two ultrafine particle counters (P-Trak) instruments were used. One instrument was operated under standard conditions and reported the total particle number concentration (in units of particles/cm3, #/cm3, pt/cm3, or 1/cm3) for all particles with diameters between 20 nm - 1 um. The second P-Trak operated using a screened inlet. The mesh size used and the number of layers of mesh were selected to intercept the smallest particles as they entered the system and changed the lower particle size to 36 nm. The difference between the unscreened and screened P-Trak instruments provided an estimate the particle concentration of 20-36 nm particles. This was verified using a particle Scanning Mobility Particle Sizer Spectrometer.
#-----p-trak-----# get p-trak not-screened file namesptrak_noscreen_file <- filenames %>%str_subset("PT") %>%str_subset("scrnd|Scrnd|Screened|screen", negate =TRUE)# read dataptrak_noscreen <-read_tsv(file.path(data_dir, ptrak_noscreen_file), skip =29) %>%# name columnsset_names(c("date","time", "ptrak_noscreen_conc")) %>%# modify date column and create datetime column (notice we must specify the# date format)mutate(date =as.Date(date, format ="%m/%d/%Y"), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# get p-trak-screened file names and read in dataptrak_screen_file <- filenames %>%str_subset("scrnd|Scrnd|Screened|screen")# read dataptrak_screen <-read_tsv(file.path(data_dir, ptrak_screen_file), skip =29) %>%# name columnsset_names(c("date","time", "ptrak_screen_conc")) %>%# modify date column and create datetime columnmutate(date =as.Date(date, format ="%m/%d/%Y"), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# show dataframes - have PNC measurements every secondptrak_noscreen
#-----co2-----# get CO2 file nameco2_file <-str_subset(filenames, "CO2")# read data (tab separated values), specify time as character because of time parsing issueco2 <-read_tsv(file.path(data_dir, co2_file), skip =1, col_types =cols(`System_Time_(h:m:s)`=col_character()) ) %>%# name columnssetNames(c("date", "time", "co2_conc", "h2o_conc", "temp", "pressure", "co2_absorp", "flow") ) %>%# select columns of interestselect(date, time, co2_conc) %>%# address time parsing (some minute values have only one digit) and # create datetime columnmutate(time =as_hms(time), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# show dataframe - have CO2 measurements every secondhead(co2)
The purrr package from tidyverse is a functional programming tool that allows you to apply a function iteratively over a list. We use that next to merge (or “join”) the datasets. Note that the full_join function searches for common column names and joins on these. Alternatively you can add this argument: by = c("date", "time", "datetime").
We print out the dataset to get a basic idea of whether the merge worked. Ask yourself questions like the following:
Are there missing data and what is its pattern?
Do the variables (columns) in the final dataset make sense to you?
We observe that early rows of the dataset have missing GPS and CO2 data, so we know this is something we need to investigate further.
#-----merge datasets-----# create a list of all data to merge instrument_list <-list(gps = gps, ptrak_noscreen = ptrak_noscreen, ptrak_screen = ptrak_screen, co2 = co2)# use `purrr::reduce` to run `full_join()` on listed items all_data <-reduce(instrument_list, full_join) %>%# reorder columns so datetime is firstrelocate(datetime) %>%# arrange data in datetime orderarrange(datetime)
Joining with `by = join_by(date, time, datetime)`
Joining with `by = join_by(date, time, datetime)`
Joining with `by = join_by(date, time, datetime)`
# use glimpse to learn more about the dataglimpse(all_data)
Let’s make a quick visualization of the concentration data to get an idea of the data we’re working with.
What features do you see?
#-----plot data-----# note that these data are all for one dayplot_date <-first(all_data$date)# transform data from wide to long for ggplotall_data %>%pivot_longer(cols =contains("_conc"), names_to ="pollutant", values_to ="concentration") %>%# pipe into ggplotggplot(aes(x = datetime, y = concentration)) +# plot each pollutant in a facetfacet_wrap(~pollutant, ncol =1, scales ="free_y") +# specify type of plotgeom_line() +# specify themetheme_bw() +labs(title = plot_date)
Data Availability
Data availability plots can be a useful initial check to visualize what data we have across instruments over time. We show one variable from each dataset. In this case the plot is not very informative because the data are fairly complete and the plot doesn’t make it easy to see the limited amount of missing data that are present even though we already observed that there are missing data after merging.
#-----data availability plot-----# transform data from wide to long for ggplot (using latitude for GPS)all_data %>%mutate(GPS = latitude) %>%pivot_longer(cols =c(contains("conc"), GPS), names_to ="measurement") %>%# make plotggplot(aes(x = datetime, y = measurement, color =factor(measurement)) ) +geom_point(size =0.5, alpha=0.4) +labs(x ="Time", y ="Instrument") +theme_bw() +theme(legend.position ="none")
Missing Data
Next we try to precisely determine how much missing data we have and what are the likely sources of missingness in our data. First we calculate the total and percent missing for every variable in the dataset. We observe that the presence of missing data depends upon the instrument.
Our data appeared to be on a similar 1-second time interval. Let’s confirm by calculating the difference in time between consecutive measurements in each of our datasets:
#-----time stamps-----# we'll use the instrument data listlapply(instrument_list, function(i){# get vector of datetimes datetimes <- i[["datetime"]]# get a vector of the differences between consecutive measurements time_diff <- datetimes -lag(datetimes)# summarize differencestable(time_diff)}) %>%# use list namesset_names(names(instrument_list))
# Most measuremnets occur every second.# note differences in the GPS - a couple of measurements appear to be duplicates. A handful of measurements have larger gaps.
It seems that some of our missingness is due to instrument reporting. Some measurements did not occur at 1-second intervals. From this we can also see there are duplicate measurements for some instruments (difference of zero seconds).
Lastly, it appears that the GPS instrument was not turned on at the same time as the other instruments (there is no GPS data at the beginning of the all_data dataframe). This may be due to the fact that the other instruments were turned on for “warm-up” before the measurement campaign started.
Let’s drop data missing the GPS variables for convenience, but is this the best thing to under all circumstances and for all variables?
Time series data are marked by measurements that are indexed to a time component. There are many R standards for time series data: ts, xts, data.frame, data.table, tibble, zoo, tsibble, tibbletime or timeSeries. The package tsbox has many useful functions for converting between these time series formats.
We’ve focused most of our effort on tidyverse tools this term, so let’s concentrate on the tsibble, feasts, and slider package functions.
First we want to turn our dataset into a tsibble. We use the function try so that knitting our document doesn’t fail if this operation fails. We observe that it does fail because there were duplicates, so then we inspect them.
#-----try tsibble-----# # use "try" here so document knits# try( as_tsibble(all_data, index = datetime) )# inspect duplicate rowsduplicates(all_data, index = datetime)
Let’s also fill the gaps in time. In other words, we have decided in this analysis that the index column datetime needs to be unique and complete. In standard time series analyses the time increment between every consecutive observation is the same. Adding these new time rows introduces “explicit” NAs to the data, so you’ll then have a choice about if and how to fill in the non-time data. tidyr::fill() can help with this task, and the choice of how to fill may be more or less important depending on the time scale you’re working on, the amount of missingness, and if there are consecutive measurements missing. Given the small amount of missing and the one-second time scale of the data, it’s not unreasonable to introduce a few rows with NAs here.
#-----fill ts-----# count rows before fillingpaste("number of rows before filling gaps:", nrow(ts_data))
[1] "number of rows before filling gaps: 15026"
# save for calculation below temp <- ts_data# Turn implicit missing values into explicit missing values (add NAs to data)ts_data <-fill_gaps(ts_data) # # if you wanted to "fill" the new explicitly missing data with a value drawn# # from the dataset, here is an option. (add a `%>%` to the function above). # # Note this could have downstream effects. # fill(co2_conc, .direction = "down")# count rows after fillingpaste("number of rows after filling gaps:", nrow(ts_data))
[1] "number of rows after filling gaps: 15070"
# see what rows were added - these contain NA values other than datesdifferences_df <-anti_join(ts_data, temp)
Joining with `by = join_by(datetime, date, time, latitude, longitude, altitude,
speed, ptrak_noscreen_conc, ptrak_screen_conc, co2_conc)`
head(differences_df)
# A tsibble: 6 x 10 [1s] <US/Pacific>
datetime date time latitude longitude altitude speed
<dttm> <date> <time> <dbl> <dbl> <dbl> <dbl>
1 2018-07-19 13:00:01 NA NA NA NA NA NA
2 2018-07-19 13:07:59 NA NA NA NA NA NA
3 2018-07-19 13:09:21 NA NA NA NA NA NA
4 2018-07-19 13:09:22 NA NA NA NA NA NA
5 2018-07-19 13:09:23 NA NA NA NA NA NA
6 2018-07-19 13:09:24 NA NA NA NA NA NA
# ℹ 3 more variables: ptrak_noscreen_conc <dbl>, ptrak_screen_conc <dbl>,
# co2_conc <dbl>
paste("number of rows added:", nrow(differences_df))
[1] "number of rows added: 44"
Temporal Autocorrelation
Time series data is often correlated in time - measurements taken close in time to one another are more similar than measurements taken farther apart. An autocorrelation plot helps identify at what lag (or time interval) data are less correlated. In the following plot, the blue lines indicate bounds on the autocorrelation of these data, i.e., which autocorrelations statistically different from zero. At what time do the CO2 appear less correlated?
The Autocorrelation Function (ACF) measures the overall correlation between a time series and its lagged values, including both direct and indirect effects from intermediate lags. In contrast, the Partial Autocorrelation Function (PACF) isolates the direct relationship between a time series and a specific lag, removing the influence of intervening lags. Use the ACF to understand the general structure of dependencies in a time series, and the PACF to determine the number of significant lags for autoregressive models.
#-----autocorrelation-----# autocorrelation plot - shows the overall correlation structure across lags## calculates correlation between a conc and its lagged versions (both immediate/direct and indirect)## ACF measures the linear (Pearson correlation coef, R) correlation between a time series and its lagged versions in a tsibble. ## plot: A high positive correlation suggests that the values are similar between the time points separated by the lag. Values near zero indicate no correlation at that lag. ## The blue dashed line may be the approximate 95% CI where no autocorrelation exists in the data.ts_data %>%ACF(co2_conc, # column used for computationlag_max =60# maximum lag at which to calculate the acf ) %>%autoplot()
# partial autocorrelation plot## shows only the correlation between a conc (e.g. at time t) and a lag conc (e.g. at time t-2), adjusting for the influence of intermediate conc lags(e.g. at time t-1).## The first lag (lag 1) has a high and statistically significant partial autocorrelation (above the blue dashed line), indicating that the immediate prior value of the time series has a strong direct influence on the current value.## After lag 1 (and possibly lag 2 or 4), the partial autocorrelations become less significant (fall within the blue dashed lines). Beyond the first lag, there is thus minimal direct influence of earlier values on the current value once the effects of lag 1 (and other significant lags) are accounted for.ts_data %>%PACF(co2_conc, lag_max =60) %>%autoplot()
Temporal Smoothing with New Time Scales – Moving Averages
A common task with time series data is averaging to different time scales. From the autocorrelation plot above, we can see at a lag of 60 seconds, the CO2 concentrations are no longer correlated. Let’s convert our 1-second data to some longer time scales. One package that helps with averaging tasks is slider.
For the first average we’ll use the time scale period (minutes) to calculate 1-minute averages. Though with other data, this could also be months, weeks, years, etc. This is a “block average” where the period we are averaging over is based on an attribute of the data. For example, if you wanted monthly averages, this method would be “aware” of the fact there are a different number of days in each month. For the second average (a 2-minute average) we’ll demonstrate a moving (“sliding” or “rolling”) average, where the number of neighbors before and after a value of interest are used to compute an average.
Notice how both functions preserve the input dataframe size. For the period average we can then easily merge the new column into our existing dataframe (however, you’ll notice values are repeated 60 times for each minute). The moving average works nicely with mutate().
slider
#-----slider-----ts_data <- ts_data %>%# left join new 1-minute period valuesleft_join(# Approach 1: aggregate data to 1 min averages using fixed period intervals## returns a dataframe with repeated 1 min avg measures every secondslide_period_dfr(ts_data, ts_data$datetime, "minute",~tibble(datetime =floor_date(.x$datetime),co2_conc_one =mean(.x$co2_conc) ) ),by ="datetime" ) %>%# add a datetime column with 1-minute precision to help if you choose to# `group_by()` and `summarise()` 1-minute averagesmutate(datetime_one =ymd_hm(format(datetime, "%Y-%m-%d %H:%M"), tz ="US/Pacific")) %>%# Approach 2: calculate moving averages, e.g., 2 minutes to provide a smoothed version of the time series(this works because we've made sure our datetime index is unique and complete)mutate(co2_conc_two =slide_dbl(co2_conc, ~mean(.x, na.rm =TRUE), .before =60, .after =60) ) # # slide_period also works in the same way, but it returns a vector of different# # length than the original inputs# with(ts_data, # slide_period_dbl(.x = co2_conc, .i = datetime, # .period = "minute", .f = mean, na.rm = TRUE) # )# glimpseglimpse(ts_data)
# compare a few values. convert to dataframe since R does not round/limit the displayed digitshead(select(ts_data, contains(c("datetime", "co2"))) %>%as.data.frame())
Next, let’s take a look at a tsibble approach for 5-minute averages. Notice that the size of the dataframe decreases.
#-----new timescales-----ts_new <- ts_data %>%# get the "floor" of each datetime row (unfortunately `tsibble` doesn't let us# use "datetime" for this new variable name)index_by(datetime_new =floor_date(datetime, unit ="5mins")) %>%# summarise the mean of rows across all dataframe columnssummarise(across(where(is.numeric), mean, na.rm =TRUE ), .groups ="drop") %>%# rename to get "datetime" variable name backrename(datetime = datetime_new) %>%# create variable so new 5 min averages are aligned (i.e., centered) with originals for plots (because of `floor_date()` behavior)mutate(plot_time = datetime +minutes(2) +seconds(30))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
ℹ In group 1: `datetime_new = 2018-07-19 12:50:00`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Following up on our autocorrelation plot above, notice how it changes. The lags are now 5 minutes long and the evidence in the autocorrelation is only in the first 2-3 lags. This is a longer duration in time than we saw with the 1-second data, but a fewer number of lags.
#---- 5-min autocorrelation ----# check 5-minute average autocorrelationts_new %>%ACF(co2_conc) %>%autoplot()
All our example data are on the 1-second time scale; however, it is common to have multiple time scales in a suite of direct-reading instruments. Using the techniques above, you’ll be able to average different time intervals to a common schedule. Once the measurements are on a shared time scale, they can be merged/joined as we have shown in the “Merge & Inspect Instrument Data” section.
Plot Different Time Scales
Let’s inspect the time series of our different averages:
#-----plot different time averages-----# we're going to use a simpler approach here and just manually specify each# series rather than make a dataframe with all the plotting data combinedggplot() +# plot original 1-second datageom_line(data = ts_data, aes(x = datetime, y = co2_conc, color ="blue")) +# plot `slider` 1-min period averagesgeom_line(data = ts_data, aes(x = datetime, y = co2_conc_one, color ="black")) +# plot `slider` 2-min moving averagesgeom_line(data = ts_data, aes(x = datetime, y = co2_conc_two, color ="red")) +# plot `tsibble` 5-minute block averages (note `x = plot_time`)geom_line(data = ts_new, aes(x = plot_time, y = co2_conc, color ="darkgreen")) +# specify legend values manuallyscale_color_identity(name ="Time Average", breaks =c("blue", "black", "red", "darkgreen"), labels =c("1-sec", "1-min", "2-min", "5-min"), guide ="legend") +# labelslabs(x ="Time", y ="CO2 (ppm)" ) +# themetheme_bw() +theme(legend.position ="bottom")
Map Concentrations
Our data was from a mobile monitoring campaign - so let’s map the concentrations.
In the following chunk we select the data corresponding to the geographic area of interest, define the bounding box (extent of the mapping area), and download a background (tile) map.
#-----get map objects-----# WGS84 latitude-longitude. same as: "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"latlong_proj <-4326# projection in meters we need for distance calculationslambert_proj <-"+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"# define the bounding box for the mapbbox <- ts_data %>%# first filter to time period (and therefore geographic area) of interestfilter(between(datetime, as.POSIXct("2018-07-19 13:13:00", tz ="US/Pacific"),as.POSIXct("2018-07-19 16:32:00", tz ="US/Pacific")) ) %>%drop_na(latitude, longitude) %>%st_as_sf(coords =c('longitude', 'latitude'), crs = latlong_proj) %>%# add a buffer (in meters)st_transform(crs = lambert_proj) %>%st_buffer(1e4) %>%st_transform(crs = latlong_proj) %>%# then pipe into `make_bbox()` #make_bbox()st_bbox()# note you could write code to save this tile plot for offline use. This can have technical issues, however, so we won't do that here. tiles <- maptiles::get_tiles(x = bbox, provider ="OpenStreetMap", zoom =12)basemap <-ggplot() +# Add basemap tiles as background ggspatial::layer_spatial(tiles) +theme_minimal()basemap
To help represent the time-varying nature of the data in this lab, let’s collect the times of some of our measurements to help our visualization.
#-----get time stamps-----map_times <- ts_data %>%# convert to `tibble` because we're breaking up our `tsibble`as_tibble() %>%# filter to rows with :00, :15, :30, and :45 minute timesfilter(str_detect(time, pattern =":00:00|:15:00|:30:00|:45:00")) %>%# make a map_time variable, dropping the seconds # (`$` indicates the character pattern should be at the end of the string)mutate(map_time =str_remove(time, pattern =":00$"))
Map Carbon Dioxide Concentrations
#-----map concentrations-----# map concentrationsmap_co2 <- basemap +# map locations with concentrationsgeom_point(data =drop_na(ts_data, co2_conc), aes(x = longitude, y = latitude, color = co2_conc)) +# add measurement times to mapgeom_text_repel(data = map_times,aes(x = longitude, y = latitude, label = map_time), force_pull =-0.02 ) +# color scalescale_color_continuous(type ="viridis") +# labelslabs(color ="CO2 (ppm)") +# Bounding box limits for latitude and longitudecoord_sf(xlim =c(bbox$xmin, bbox$xmax),ylim =c(bbox$ymin, bbox$ymax)) +# themetheme_void() # show mapmap_co2
Background Standardization
We show different approaches to estimating the background in this section and plot them for comparison.
Rolling Percentile & Rolling Minimum
Brantley et al., 2014 summarize a number of methods to calculate background concentrations. One approach described by Bukowiecki et al., 2002, includes taking the 5th percentile of 1 or 5 min averages. As we talked about in class, another approach is to calculate the moving minimum of 30-minute time intervals. slider simplifies these types of moving calculations.
#-----calculate background with moving functions-----# specify the time in minutes for moving calculations # (for percentile `_pct` and minimum `_min`)time_pct <-5time_min <-30# specify the percentile considered to be backgroundpct <-0.05# use use `slider` to add moving background calculations to `ts_data`. Then# calculate an adjusted concentration by removing the background concentration.# For moving percentile, assign new names with suffix `_pct`, and for moving# minimums use `_min`. `min()` will return Inf when all values are NA; use# `na_if()` to avoid.ts_data <- ts_data %>%mutate( # using moving percentilesco2_background_pct =slide_dbl(co2_conc, ~quantile(.x, probs = pct, na.rm =TRUE), .before = (time_pct*60)/2, .after = (time_pct*60)/2), co2_adj_pct = co2_conc - co2_background_pct, # using moving minimumsco2_background_min =slide_dbl(co2_conc, ~min(.x, na.rm =TRUE), .before = (time_min*60)/2, .after = (time_min*60)/2), co2_background_min =na_if(co2_background_min, Inf), co2_adj_min = co2_conc - co2_background_min )
For the percentile approach we’ve taken 5 minutes as the time interval and 5th percentile of the data to calculate the background CO2 concentration. Let’s compare our two background estimation approaches in one plot:
#-----plot background-----# plot the two background estimates on the same plotggplot(data = ts_data) +geom_line(aes(x = datetime, y = co2_background_pct, color ="black")) +geom_line(aes(x = datetime, y = co2_background_min, color ="red")) +labs(x ="Time", y ="CO2 Background (ppm)") +# specify legend values manuallyscale_color_identity(name ="Background Type",breaks =c("black", "red"),labels =c("Percentile", "Minimum"),guide ="legend") +# themetheme_bw() +theme(legend.position ="bottom")
Loess Smoother of Background
We can further smooth our estimates of background. These background estimates, even when smoothed, appear to be picking up more than background. However, in a later plot we consider the background estimates in the context of the data, which are much higher, so the apparent high variation in background isn’t that large in the context of the complete CO2 dataset.
#-----loess smooth of moving background-----# specify loess modelsmdl_pct <-loess(co2_background_pct ~seq_along(time), data = ts_data, span =0.1, degree =1)mdl_min <-loess(co2_background_min ~seq_along(time), data = ts_data, span =0.1, degree =1)# add smoothed predictions of background to `ts_data`ts_data <- ts_data %>%mutate(co2_background_pctl =predict(mdl_pct, ts_data),co2_background_minl =predict(mdl_min, ts_data)) # prepare dataframe for ggplot # (creating new columns for our different types of estimates)ts_data %>%pivot_longer(contains("_background_")) %>%mutate(min_pct =if_else(str_detect(name, "min"), "minimum", "percentile"), smooth_un =if_else(str_detect(name, "l"), "smoothed", "unsmoothed") ) %>%# compare all background estimatesggplot(aes(x = datetime, y = value, linetype =interaction(smooth_un, min_pct, sep =" "), color =interaction(smooth_un, min_pct, sep =" ") ) ) +# plot linesgeom_line() +# set colorsscale_color_manual(name ="Background Type", values =c("red", "red", "black", "black")) +# set linetypescale_linetype_manual(name ="Background Type", values =c("dashed", "solid", "dashed", "solid")) +# labelslabs(x ="Time", y ="CO2 Background (ppm)") +# themetheme_bw() +theme(legend.position ="bottom")
Plot Carbon Dioxide with Background
We’ll demonstrate with the the rolling minimum approach, showing both the unsmoothed and loess smoothed versions.
#-----plot co2 and background-----ggplot(data = ts_data) +# plot the co2 observationsgeom_point(aes(x = datetime, y = co2_conc), size =0.8, color ="darkgray") +# plot co2 background using the loess smoothed and unsmoothed rolling minimumsgeom_line(aes(x = datetime, y = co2_background_minl), lty ="dashed", color ="red") +geom_line(aes(x = datetime, y = co2_background_min), lty ="solid", color ="red") +# labelslabs(x ="Time", y ="CO2 with Background (ppm)") +# themetheme_bw()
Identifying Peak Concentrations
As we’ve seen, mobile monitoring data has a mix of temporal and spatial variability. In an effort to disentangle temporal confounding from spatial analysis, we can calculate the background concentration of a pollutant. Peak identification can then be performed on the background adjusted data to identify single or multi-pollutant extreme features (for example, departures above a particular quantile of single pollutant features vs departure above background of a multivariate representation such as a PCA feature).
With background removed from the CO2 concentrations, we can now use the adjusted data to identify concentration peaks. Concentration peaks indicate potential sources of the pollutant. In this case, a CO2 peak could be a large vehicle like a truck.
#-----id peaks-----# specify the time in minutes for the rolling windowtime_window <-1# specify windsorize cut-point (percentile)w_pct <-0.95# specify cutoff for peak change in slope valuespqv <-0.6# we can use use `slider` to calculate these rolling statistics# use our previous adjusted co2 from the rolling 5th percentile (co2_adj_pct)ts_data <- ts_data %>%mutate(# calculate adjusted CO2 concentrationco2_adj = (co2_conc - co2_background_min),# identify the "next" value in the series with `lead()`co2_lag =lead(co2_adj), # calculate squared difference in consecutive CO2 concentrationsco2_diff_sq = (co2_adj - co2_lag)^2,# limit extreme values by windsorizingco2_diff_sq =replace(co2_diff_sq, co2_diff_sq >quantile(co2_diff_sq, w_pct, na.rm =TRUE), quantile(co2_diff_sq, w_pct, na.rm =TRUE)),# calculate a 1 minute rolling mean of CO2 differencesroll_co2_diff =slide_dbl(co2_diff_sq, ~mean(.x, na.rm =TRUE), .before = (time_window*60)/2, .after = (time_window*60)/2),# calculate peaksco2_is_peak =if_else(roll_co2_diff >=quantile(roll_co2_diff, pqv, na.rm =TRUE), TRUE, FALSE) )
#-----plot co2 peaks-----ggplot(ts_data) +# plot co2 observationsgeom_point(aes(x = datetime, y = co2_conc, color = co2_is_peak), size =0.8) +# plot backgroundgeom_line(aes(x = datetime, y = co2_background_pct)) +# set colorsscale_colour_manual(values =c("darkgray", "darkred")) +# labelslabs(x ="Time", y ="CO2 Peaks with Background (ppm)") +# themetheme_bw() +theme(legend.position ="none")
Local Emission Plume Detection with P-Trak
As the CO2 map shows, the data we’ve collected represent concentrations over 4 hours, meaning some of the variability we’re observing could be a mix of spatial (i.e. the variability we see in the map) and temporal (i.e. the variability observed in time series plots) processes.
Next, we’ll use the P-Trak emission ratio data to explore the idea of pollutant emission ratios described by Kolb et al., 2004, which normalizes pollutant concentrations to CO2 concentrations, potentially helping identify emission sources in data with both temporal and spatial variability. We’ll use the P-Trak data to explore the idea of pollutant ratios by plotting the proportion of smallest particles (20-36 nm to the total concentration of UFP particles).
#-----map emission.ratios-----# calculate P-Trak ratiots_data <- ts_data %>%mutate(ptrak_ratio = (ptrak_noscreen_conc - ptrak_screen_conc) / ptrak_noscreen_conc)# map ratios that are greater than zero map_ratios <- basemap +# locations with concentrationsgeom_point(data = ts_data %>%drop_na(ptrak_ratio), aes(x = longitude, y = latitude, color = ptrak_ratio)) +# color scalescale_color_continuous(type ="viridis", limits =c(0,1)) +# labelslabs(color ="P-trak Ratio \nscreened:unscreened") +# themetheme_void() # show mapmap_ratios
Practice Session
Use the data from car1 or car2 to repeat a subset of the analyses demonstrated above for ultrafine particles and write up a coherent report about the patterns you see. The goal of this lab is for you to use the code we have provided to read in time-varying data from multiple instruments, check that they were read in appropriately, and address a few scientific questions about time-varying exposures. Here are the topics for you to address in your lab report:
Data description, data quality, and approach to analysis:
How many observations were available, and how many were usable after preliminary data management (e.g. dropping NAs after merging)?
What variables are available and which ones are of scientific interest?
When were the data collected and in what area?
What steps did you follow to accomplish the scientific analyses you discuss in the rest of the report?
Scientific analyses:
Plot a time series plot for the screened and unscreened P-Trak data on one plot. Do you see a different pattern for the two measurements of ultrafine particles?
Look at the autocorrelation in the unscreend P-Trak data. Compare autocorrelation plots at the 1-second time scale with autocorrelation plots at the 5-minute time scale. Describe the patterns and comment on their similarities and differences.
Choose one additional analysis of interest to you that will allow you to highlight a feature of the time-varying data that is of scientific interest to you. Justify your reasoning for your choice, leveraging your understanding from the lecture and/or Brantley paper. For instance, you might consider comparing different background estimation approaches, showing a time series with background overlaid or removed, comparing the difference between the screened and unscreened P-Traks over space and/or time, or, identifying peak concentrations. Present an appropriate choice of plots and/or maps to support your analysis. Interpret these and connect your interpretation to the material discussed in the lecture.
Code Appendix
Session Information
#-----session info: beginning of Code Appendix-----sessionInfo()
#-----setup-----# set knitr optionsknitr::opts_chunk$set(echo =TRUE)# clear work space of all objects and unload all extra (non-base) packagesrm(list =ls(all =TRUE))if (!is.null(sessionInfo()$otherPkgs)) { res <-suppressWarnings(lapply(paste('package:', names(sessionInfo()$otherPkgs), sep=""), detach, character.only=TRUE, unload=TRUE, force=TRUE))rm(res)}#-----load libraries-----# Load pacman into memory, installing as neededmy_repo <-'http://cran.r-project.org'if (!require("pacman")) {install.packages("pacman", repos = my_repo)}# Load the other packages, installing as needed.pacman::p_load(knitr, dplyr, tidyr, readr, stringr, ggplot2, hms, lubridate, purrr, tsibble, feasts, slider, #ggmap, ggspatial, maptiles, sf, # mapping ggrepel)#-----directory organization and read data-----# specify data directorydata_dir <-file.path("Datasets", "mobile_monitoring")# name destination filedest <-file.path(data_dir, "Jul10MOV-UP.zip")# Download files if not already presentif (!dir.exists(dest)) {# create "Datasets" and "mobile_monitoring" directories if they do not already# existdir.create(file.path("Datasets", "mobile_monitoring"), showWarnings =FALSE, recursive =TRUE)# specify URL of zip file url <-"https://faculty.washington.edu/sheppard/envh556/Datasets/mobile_monitoring/Jul19MOV-UP.zip"# download zipdownload.file(url = url, destfile = dest)# unzip fileunzip(zipfile = dest, exdir = data_dir)# # delete zip files (optional)# unlink(dest)# unlink(str_subset("__MACOSX", list.files(data_dir)), recursive = TRUE)}#-----datetime.and.date objects-----# show POSIXct object -- a datetime object with both date and timeSys.time()# show class# Note that 'POSIXt' is a virtual class which cannot be used directly. This# virtual class exists from which both of the classes (POSIXct and POSIXlt)# inherit. POSIXt is used to allow operations, such as subtraction, to mix the# two classes.class(Sys.time())# show Date object-- date without time includedSys.Date()# show classclass(Sys.Date())#-----gather file names for import-----# get filename paths for data filesfilenames <-list.files(data_dir) %>%str_subset("car1")#-----gps-----# Get GPS file name gps_file <-str_subset(filenames, "GPS")# read GPS datagps <-read_csv(file.path(data_dir, gps_file)) %>%# set column namesset_names(c("date", "time", "latitude", "longitude", "altitude", "speed")) %>%# create datetime column (notice we must specify the timezone)mutate(datetime =as.POSIXct(paste(date, time), tz ="US/Pacific"))# show dataframegps# and time zonetz(gps$datetime)#-----p-trak-----# get p-trak not-screened file namesptrak_noscreen_file <- filenames %>%str_subset("PT") %>%str_subset("scrnd|Scrnd|Screened|screen", negate =TRUE)# read dataptrak_noscreen <-read_tsv(file.path(data_dir, ptrak_noscreen_file), skip =29) %>%# name columnsset_names(c("date","time", "ptrak_noscreen_conc")) %>%# modify date column and create datetime column (notice we must specify the# date format)mutate(date =as.Date(date, format ="%m/%d/%Y"), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# get p-trak-screened file names and read in dataptrak_screen_file <- filenames %>%str_subset("scrnd|Scrnd|Screened|screen")# read dataptrak_screen <-read_tsv(file.path(data_dir, ptrak_screen_file), skip =29) %>%# name columnsset_names(c("date","time", "ptrak_screen_conc")) %>%# modify date column and create datetime columnmutate(date =as.Date(date, format ="%m/%d/%Y"), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# show dataframes - have PNC measurements every secondptrak_noscreenptrak_screen#-----co2-----# get CO2 file nameco2_file <-str_subset(filenames, "CO2")# read data (tab separated values), specify time as character because of time parsing issueco2 <-read_tsv(file.path(data_dir, co2_file), skip =1, col_types =cols(`System_Time_(h:m:s)`=col_character()) ) %>%# name columnssetNames(c("date", "time", "co2_conc", "h2o_conc", "temp", "pressure", "co2_absorp", "flow") ) %>%# select columns of interestselect(date, time, co2_conc) %>%# address time parsing (some minute values have only one digit) and # create datetime columnmutate(time =as_hms(time), datetime =as.POSIXct(paste(date, time), tz ="US/Pacific") )# show dataframe - have CO2 measurements every secondhead(co2)str(co2$datetime)tz(co2$datetime)#-----merge datasets-----# create a list of all data to merge instrument_list <-list(gps = gps, ptrak_noscreen = ptrak_noscreen, ptrak_screen = ptrak_screen, co2 = co2)# use `purrr::reduce` to run `full_join()` on listed items all_data <-reduce(instrument_list, full_join) %>%# reorder columns so datetime is firstrelocate(datetime) %>%# arrange data in datetime orderarrange(datetime)# use glimpse to learn more about the dataglimpse(all_data)#-----plot data-----# note that these data are all for one dayplot_date <-first(all_data$date)# transform data from wide to long for ggplotall_data %>%pivot_longer(cols =contains("_conc"), names_to ="pollutant", values_to ="concentration") %>%# pipe into ggplotggplot(aes(x = datetime, y = concentration)) +# plot each pollutant in a facetfacet_wrap(~pollutant, ncol =1, scales ="free_y") +# specify type of plotgeom_line() +# specify themetheme_bw() +labs(title = plot_date)#-----data availability plot-----# transform data from wide to long for ggplot (using latitude for GPS)all_data %>%mutate(GPS = latitude) %>%pivot_longer(cols =c(contains("conc"), GPS), names_to ="measurement") %>%# make plotggplot(aes(x = datetime, y = measurement, color =factor(measurement)) ) +geom_point(size =0.5, alpha=0.4) +labs(x ="Time", y ="Instrument") +theme_bw() +theme(legend.position ="none")#-----missing-----lapply(all_data, function(i){ tibble( # sum missingn_miss =sum(is.na(i)), # percent missingperc_miss =round(n_miss/length(i) *100, 1) ) }) %>%# bind listbind_rows(.id ="variable")#-----time stamps-----# we'll use the instrument data listlapply(instrument_list, function(i){# get vector of datetimes datetimes <- i[["datetime"]]# get a vector of the differences between consecutive measurements time_diff <- datetimes -lag(datetimes)# summarize differencestable(time_diff)}) %>%# use list namesset_names(names(instrument_list))# Most measuremnets occur every second.# note differences in the GPS - a couple of measurements appear to be duplicates. A handful of measurements have larger gaps. #-----drop missing gps-----all_data <-drop_na(all_data, latitude, longitude)#-----try tsibble-----# # use "try" here so document knits# try( as_tsibble(all_data, index = datetime) )# inspect duplicate rowsduplicates(all_data, index = datetime)#-----to tsibble-----# remove duplicate rows, and convert to `tsibble`ts_data <- all_data %>%distinct(datetime, .keep_all =TRUE) %>%as_tsibble(index = datetime)head(ts_data)#-----fill ts-----# count rows before fillingpaste("number of rows before filling gaps:", nrow(ts_data))# save for calculation below temp <- ts_data# Turn implicit missing values into explicit missing values (add NAs to data)ts_data <-fill_gaps(ts_data) # # if you wanted to "fill" the new explicitly missing data with a value drawn# # from the dataset, here is an option. (add a `%>%` to the function above). # # Note this could have downstream effects. # fill(co2_conc, .direction = "down")# count rows after fillingpaste("number of rows after filling gaps:", nrow(ts_data))# see what rows were added - these contain NA values other than datesdifferences_df <-anti_join(ts_data, temp)head(differences_df)paste("number of rows added:", nrow(differences_df))#-----autocorrelation-----# autocorrelation plot - shows the overall correlation structure across lags## calculates correlation between a conc and its lagged versions (both immediate/direct and indirect)## ACF measures the linear (Pearson correlation coef, R) correlation between a time series and its lagged versions in a tsibble. ## plot: A high positive correlation suggests that the values are similar between the time points separated by the lag. Values near zero indicate no correlation at that lag. ## The blue dashed line may be the approximate 95% CI where no autocorrelation exists in the data.ts_data %>%ACF(co2_conc, # column used for computationlag_max =60# maximum lag at which to calculate the acf ) %>%autoplot()# partial autocorrelation plot## shows only the correlation between a conc (e.g. at time t) and a lag conc (e.g. at time t-2), adjusting for the influence of intermediate conc lags(e.g. at time t-1).## The first lag (lag 1) has a high and statistically significant partial autocorrelation (above the blue dashed line), indicating that the immediate prior value of the time series has a strong direct influence on the current value.## After lag 1 (and possibly lag 2 or 4), the partial autocorrelations become less significant (fall within the blue dashed lines). Beyond the first lag, there is thus minimal direct influence of earlier values on the current value once the effects of lag 1 (and other significant lags) are accounted for.ts_data %>%PACF(co2_conc, lag_max =60) %>%autoplot()#-----slider-----ts_data <- ts_data %>%# left join new 1-minute period valuesleft_join(# Approach 1: aggregate data to 1 min averages using fixed period intervals## returns a dataframe with repeated 1 min avg measures every secondslide_period_dfr(ts_data, ts_data$datetime, "minute",~tibble(datetime =floor_date(.x$datetime),co2_conc_one =mean(.x$co2_conc) ) ),by ="datetime" ) %>%# add a datetime column with 1-minute precision to help if you choose to# `group_by()` and `summarise()` 1-minute averagesmutate(datetime_one =ymd_hm(format(datetime, "%Y-%m-%d %H:%M"), tz ="US/Pacific")) %>%# Approach 2: calculate moving averages, e.g., 2 minutes to provide a smoothed version of the time series(this works because we've made sure our datetime index is unique and complete)mutate(co2_conc_two =slide_dbl(co2_conc, ~mean(.x, na.rm =TRUE), .before =60, .after =60) ) # # slide_period also works in the same way, but it returns a vector of different# # length than the original inputs# with(ts_data, # slide_period_dbl(.x = co2_conc, .i = datetime, # .period = "minute", .f = mean, na.rm = TRUE) # )# glimpseglimpse(ts_data)# compare a few values. convert to dataframe since R does not round/limit the displayed digitshead(select(ts_data, contains(c("datetime", "co2"))) %>%as.data.frame())#-----new timescales-----ts_new <- ts_data %>%# get the "floor" of each datetime row (unfortunately `tsibble` doesn't let us# use "datetime" for this new variable name)index_by(datetime_new =floor_date(datetime, unit ="5mins")) %>%# summarise the mean of rows across all dataframe columnssummarise(across(where(is.numeric), mean, na.rm =TRUE ), .groups ="drop") %>%# rename to get "datetime" variable name backrename(datetime = datetime_new) %>%# create variable so new 5 min averages are aligned (i.e., centered) with originals for plots (because of `floor_date()` behavior)mutate(plot_time = datetime +minutes(2) +seconds(30))# glimpseglimpse(ts_new)#---- 5-min autocorrelation ----# check 5-minute average autocorrelationts_new %>%ACF(co2_conc) %>%autoplot()#-----plot different time averages-----# we're going to use a simpler approach here and just manually specify each# series rather than make a dataframe with all the plotting data combinedggplot() +# plot original 1-second datageom_line(data = ts_data, aes(x = datetime, y = co2_conc, color ="blue")) +# plot `slider` 1-min period averagesgeom_line(data = ts_data, aes(x = datetime, y = co2_conc_one, color ="black")) +# plot `slider` 2-min moving averagesgeom_line(data = ts_data, aes(x = datetime, y = co2_conc_two, color ="red")) +# plot `tsibble` 5-minute block averages (note `x = plot_time`)geom_line(data = ts_new, aes(x = plot_time, y = co2_conc, color ="darkgreen")) +# specify legend values manuallyscale_color_identity(name ="Time Average", breaks =c("blue", "black", "red", "darkgreen"), labels =c("1-sec", "1-min", "2-min", "5-min"), guide ="legend") +# labelslabs(x ="Time", y ="CO2 (ppm)" ) +# themetheme_bw() +theme(legend.position ="bottom")#-----get map objects-----# WGS84 latitude-longitude. same as: "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"latlong_proj <-4326# projection in meters we need for distance calculationslambert_proj <-"+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"# define the bounding box for the mapbbox <- ts_data %>%# first filter to time period (and therefore geographic area) of interestfilter(between(datetime, as.POSIXct("2018-07-19 13:13:00", tz ="US/Pacific"),as.POSIXct("2018-07-19 16:32:00", tz ="US/Pacific")) ) %>%drop_na(latitude, longitude) %>%st_as_sf(coords =c('longitude', 'latitude'), crs = latlong_proj) %>%# add a buffer (in meters)st_transform(crs = lambert_proj) %>%st_buffer(1e4) %>%st_transform(crs = latlong_proj) %>%# then pipe into `make_bbox()` #make_bbox()st_bbox()# note you could write code to save this tile plot for offline use. This can have technical issues, however, so we won't do that here. tiles <- maptiles::get_tiles(x = bbox, provider ="OpenStreetMap", zoom =12)basemap <-ggplot() +# Add basemap tiles as background ggspatial::layer_spatial(tiles) +theme_minimal()basemap#-----get time stamps-----map_times <- ts_data %>%# convert to `tibble` because we're breaking up our `tsibble`as_tibble() %>%# filter to rows with :00, :15, :30, and :45 minute timesfilter(str_detect(time, pattern =":00:00|:15:00|:30:00|:45:00")) %>%# make a map_time variable, dropping the seconds # (`$` indicates the character pattern should be at the end of the string)mutate(map_time =str_remove(time, pattern =":00$"))#-----map concentrations-----# map concentrationsmap_co2 <- basemap +# map locations with concentrationsgeom_point(data =drop_na(ts_data, co2_conc), aes(x = longitude, y = latitude, color = co2_conc)) +# add measurement times to mapgeom_text_repel(data = map_times,aes(x = longitude, y = latitude, label = map_time), force_pull =-0.02 ) +# color scalescale_color_continuous(type ="viridis") +# labelslabs(color ="CO2 (ppm)") +# Bounding box limits for latitude and longitudecoord_sf(xlim =c(bbox$xmin, bbox$xmax),ylim =c(bbox$ymin, bbox$ymax)) +# themetheme_void() # show mapmap_co2#-----calculate background with moving functions-----# specify the time in minutes for moving calculations # (for percentile `_pct` and minimum `_min`)time_pct <-5time_min <-30# specify the percentile considered to be backgroundpct <-0.05# use use `slider` to add moving background calculations to `ts_data`. Then# calculate an adjusted concentration by removing the background concentration.# For moving percentile, assign new names with suffix `_pct`, and for moving# minimums use `_min`. `min()` will return Inf when all values are NA; use# `na_if()` to avoid.ts_data <- ts_data %>%mutate( # using moving percentilesco2_background_pct =slide_dbl(co2_conc, ~quantile(.x, probs = pct, na.rm =TRUE), .before = (time_pct*60)/2, .after = (time_pct*60)/2), co2_adj_pct = co2_conc - co2_background_pct, # using moving minimumsco2_background_min =slide_dbl(co2_conc, ~min(.x, na.rm =TRUE), .before = (time_min*60)/2, .after = (time_min*60)/2), co2_background_min =na_if(co2_background_min, Inf), co2_adj_min = co2_conc - co2_background_min )#-----plot background-----# plot the two background estimates on the same plotggplot(data = ts_data) +geom_line(aes(x = datetime, y = co2_background_pct, color ="black")) +geom_line(aes(x = datetime, y = co2_background_min, color ="red")) +labs(x ="Time", y ="CO2 Background (ppm)") +# specify legend values manuallyscale_color_identity(name ="Background Type",breaks =c("black", "red"),labels =c("Percentile", "Minimum"),guide ="legend") +# themetheme_bw() +theme(legend.position ="bottom")#-----loess smooth of moving background-----# specify loess modelsmdl_pct <-loess(co2_background_pct ~seq_along(time), data = ts_data, span =0.1, degree =1)mdl_min <-loess(co2_background_min ~seq_along(time), data = ts_data, span =0.1, degree =1)# add smoothed predictions of background to `ts_data`ts_data <- ts_data %>%mutate(co2_background_pctl =predict(mdl_pct, ts_data),co2_background_minl =predict(mdl_min, ts_data)) # prepare dataframe for ggplot # (creating new columns for our different types of estimates)ts_data %>%pivot_longer(contains("_background_")) %>%mutate(min_pct =if_else(str_detect(name, "min"), "minimum", "percentile"), smooth_un =if_else(str_detect(name, "l"), "smoothed", "unsmoothed") ) %>%# compare all background estimatesggplot(aes(x = datetime, y = value, linetype =interaction(smooth_un, min_pct, sep =" "), color =interaction(smooth_un, min_pct, sep =" ") ) ) +# plot linesgeom_line() +# set colorsscale_color_manual(name ="Background Type", values =c("red", "red", "black", "black")) +# set linetypescale_linetype_manual(name ="Background Type", values =c("dashed", "solid", "dashed", "solid")) +# labelslabs(x ="Time", y ="CO2 Background (ppm)") +# themetheme_bw() +theme(legend.position ="bottom")#-----plot co2 and background-----ggplot(data = ts_data) +# plot the co2 observationsgeom_point(aes(x = datetime, y = co2_conc), size =0.8, color ="darkgray") +# plot co2 background using the loess smoothed and unsmoothed rolling minimumsgeom_line(aes(x = datetime, y = co2_background_minl), lty ="dashed", color ="red") +geom_line(aes(x = datetime, y = co2_background_min), lty ="solid", color ="red") +# labelslabs(x ="Time", y ="CO2 with Background (ppm)") +# themetheme_bw()#-----id peaks-----# specify the time in minutes for the rolling windowtime_window <-1# specify windsorize cut-point (percentile)w_pct <-0.95# specify cutoff for peak change in slope valuespqv <-0.6# we can use use `slider` to calculate these rolling statistics# use our previous adjusted co2 from the rolling 5th percentile (co2_adj_pct)ts_data <- ts_data %>%mutate(# calculate adjusted CO2 concentrationco2_adj = (co2_conc - co2_background_min),# identify the "next" value in the series with `lead()`co2_lag =lead(co2_adj), # calculate squared difference in consecutive CO2 concentrationsco2_diff_sq = (co2_adj - co2_lag)^2,# limit extreme values by windsorizingco2_diff_sq =replace(co2_diff_sq, co2_diff_sq >quantile(co2_diff_sq, w_pct, na.rm =TRUE), quantile(co2_diff_sq, w_pct, na.rm =TRUE)),# calculate a 1 minute rolling mean of CO2 differencesroll_co2_diff =slide_dbl(co2_diff_sq, ~mean(.x, na.rm =TRUE), .before = (time_window*60)/2, .after = (time_window*60)/2),# calculate peaksco2_is_peak =if_else(roll_co2_diff >=quantile(roll_co2_diff, pqv, na.rm =TRUE), TRUE, FALSE) ) #-----plot co2 peaks-----ggplot(ts_data) +# plot co2 observationsgeom_point(aes(x = datetime, y = co2_conc, color = co2_is_peak), size =0.8) +# plot backgroundgeom_line(aes(x = datetime, y = co2_background_pct)) +# set colorsscale_colour_manual(values =c("darkgray", "darkred")) +# labelslabs(x ="Time", y ="CO2 Peaks with Background (ppm)") +# themetheme_bw() +theme(legend.position ="none")#-----map emission.ratios-----# calculate P-Trak ratiots_data <- ts_data %>%mutate(ptrak_ratio = (ptrak_noscreen_conc - ptrak_screen_conc) / ptrak_noscreen_conc)# map ratios that are greater than zero map_ratios <- basemap +# locations with concentrationsgeom_point(data = ts_data %>%drop_na(ptrak_ratio), aes(x = longitude, y = latitude, color = ptrak_ratio)) +# color scalescale_color_continuous(type ="viridis", limits =c(0,1)) +# labelslabs(color ="P-trak Ratio \nscreened:unscreened") +# themetheme_void() # show mapmap_ratios#-----session info: beginning of Code Appendix-----sessionInfo()#-----appendix code-----#-----functions used in this Rmd-----# Show the names of all functions used (loaded in the current environment)lsf.str()# Show the definitions of all functions loaded into the current environment lsf.str() %>%set_names() %>%map(get, .GlobalEnv)
User-Written Functions Loaded in the R Markdown Environment
#-----functions used in this Rmd-----# Show the names of all functions used (loaded in the current environment)lsf.str()# Show the definitions of all functions loaded into the current environment lsf.str() %>%set_names() %>%map(get, .GlobalEnv)